Q learning with finite trials

نویسنده

Onno Zoeter

چکیده

The standard reinforcement learningmodel is powerful enough to deal with never ending trials. By slightly discounting rewards obtained in the future, an infinite walk in the environment is still guaranteed to have a finite expected future reward. This however comes at a price. The discounting may corrupt estimates of the expected return in ending trials. Also in most cases algorithms that can deal with continuing trials rely heavily on arbitrary initial estimates of the optimal policy Q . If these estimates are off, convergence can be slowed down considerably. In terminal states the expected future return is 0, this can be known exactly after the first visit. so we can expect we can improve the Q learning algorithm by anticipating on terminal states. This thesis investigates MDPs in which discounting is unnecessary. New work is presented for the deterministic case. In deterministic MDPs, the set of undiscounted Bellman equations for Q has a unique solution if and only if from every state there is a path to a terminal state, and if all cycles have negative weight. In this set of MDPs the Q learning algorithm can be optimized by altering it in such a way that the initial estimates are not used. If the agent trains and explores in a fair way, that is, if the agent never fully ignores a state action pair, this improved Q learning algorithm is proven to converge in finite time. The proof gives a worst case guarantee of the performance. To gain insight in the average case performance, and the effect of the exploration strategy, several experiments in the simple three-in-a-row game of tic-tac-toe have been conducted. New performance measures, based on the (precomputed) optimal table Q are introduced to more accurately study the behavior of the algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic

In this paper, a model-free reinforcement learning-based controller is designed to extract a treatment protocol because the design of a model-based controller is complex due to the highly nonlinear dynamics of cancer. The Q-learning algorithm is used to develop an optimal controller for cancer chemotherapy drug dosing. In the Q-learning algorithm, each entry of the Q-table is updated using data...

متن کامل

A Smoothed Q-Learning Algorithm for Estimating Optimal Dynamic Treatment Regimes∗

In this paper we propose a smoothed Q-learning algorithm for estimating optimal dynamic treatment regimes. In contrast to the Q-learning algorithm in which non-regular inference is involved, we show that under assumptions adopted in this paper, the proposed smoothed Q-learning estimator is asymptotically normally distributed even when the Q-learning estimator is not and its asymptotic variance ...

متن کامل

Quasirecognition by the prime graph of L_3(q) where 3 < q < 100

Let $G$ be a finite group. We construct the prime graph of $ G $,which is denoted by $ Gamma(G) $ as follows: the vertex set of thisgraph is the prime divisors of $ |G| $ and two distinct vertices $ p$ and $ q $ are joined by an edge if and only if $ G $ contains anelement of order $ pq $.In this paper, we determine finite groups $ G $ with $ Gamma(G) =Gamma(L_3(q)) $, $2 leq q < 100 $ and prov...

متن کامل

Evaluating project’s completion time with Q-learning

Nowadays project management is a key component in introductory operations management. The educators and the researchers in these areas advocate representing a project as a network and applying the solution approaches for network models to them to assist project managers to monitor their completion. In this paper, we evaluated project’s completion time utilizing the Q-learning algorithm. So the ...

متن کامل

Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis

We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the well-known E and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. We also present a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Q learning with finite trials

نویسنده

چکیده

منابع مشابه

Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic

A Smoothed Q-Learning Algorithm for Estimating Optimal Dynamic Treatment Regimes∗

Quasirecognition by the prime graph of L_3(q) where 3 < q < 100

Evaluating project’s completion time with Q-learning

Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis

عنوان ژورنال:

اشتراک گذاری